R. J. Serrano
03/02/2023
Learning objectives:
Develop linear models with time series.
Developing useful predictors for linear models.
Residual diagnostics.
Selecting predictors and forecast evaluation.
Forecasting with regression.
Matrix formulation
Correlation, causation and forecasting.
In this chapter, we will discuss regression models. The basic concept is that we forecast the time series of interest \(y\) assuming that is has a linear relationship with other time series \(x\).
In the simplest case, the regression model allows for a linear relationship between the forecast variable \(y\) and a single predictor \(x\):
Figure 7.1: An example of data from a simple linear regression model
Let’s plot the time series of quarterly changes (growth rates) of real personal consumption expenditure (\(y\)) and real personal disposable income (\(x\)) for the U.S. from 1970 Q1 and 2019 Q2.
us_change %>%
pivot_longer(c(Consumption, Income),
names_to = 'Series') %>%
autoplot(value) +
labs(y = '% change')Using the lm base function to calculate intercept and
slope coefficients.
us_chg_tbl <- us_change %>%
as_tibble() %>%
select(Consumption, Income)
model_lm <- lm(Consumption ~ Income, data = us_chg_tbl)
summary(model_lm)##
## Call:
## lm(formula = Consumption ~ Income, data = us_chg_tbl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.58236 -0.27777 0.01862 0.32330 1.42229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.54454 0.05403 10.079 < 2e-16 ***
## Income 0.27183 0.04673 5.817 2.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5905 on 196 degrees of freedom
## Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429
## F-statistic: 33.84 on 1 and 196 DF, p-value: 2.402e-08
Scatterplot with fitted regression line.
us_change |>
ggplot(aes(x = Income, y = Consumption)) +
labs(y = "Consumption (quarterly % change)",
x = "Income (quarterly % change)") +
geom_point() +
geom_smooth(method = "lm", se = FALSE)Cross-correlation between the two time series.
The cross correlation plot can be used to determine lags as useful predictors.
us_change |>
# drop `Consumption`, `Income` columns
select(-Consumption, -Income) |>
pivot_longer(-Quarter) |>
ggplot(aes(Quarter, value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y") +
guides(colour = "none") +
labs(y="% change")Pair plots for us_change
library(GGally)
us_change %>%
ggpairs(columns = 2:6,
lower = list(
continuous = wrap('smooth',
color = 'steelblue',
alpha = 0.3,
size = 0.3)
)
)They have mean zero; otherwise the forecasts will be systematically biased.
They are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited.
They are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model.
The residuals follow a Gaussian distribution (normal) with a constant variance (\(\sigma\)^2)
The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors.
fit_consMR <- us_change |>
model(tslm = TSLM(Consumption ~ Income + Production +
Unemployment + Savings))
report(fit_consMR)## Series: Consumption
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90555 -0.15821 -0.03608 0.13618 1.15471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
## Income 0.740583 0.040115 18.461 < 2e-16 ***
## Production 0.047173 0.023142 2.038 0.0429 *
## Unemployment -0.174685 0.095511 -1.829 0.0689 .
## Savings -0.052890 0.002924 -18.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
## F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16
augment(fit_consMR) |>
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Consumption, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
labs(y = NULL,
title = "Percent change in US consumption expenditure"
) +
scale_colour_manual(values=c(Data="black",Fitted="#D55E00")) +
guides(colour = guide_legend(title = NULL))augment(fit_consMR) |>
ggplot(aes(x = Consumption, y = .fitted)) +
geom_point() +
labs(
y = "Fitted (predicted values)",
x = "Data (actual values)",
title = "Percent change in US consumption expenditure"
) +
geom_abline(intercept = 0, slope = 1, color = 'red')A common way to summarise how well a linear regression model fits the data is via the coefficient of determination, or \(R\)^2. For the example above, the \(R\)^2 = 0.768.
Residual plots
Using the gg_tsresiduals() function introduced in
Section 5.3, we can obtain all the useful residual diagnostics mentioned
above.
The time plot shows some changing variation over time, but is otherwise relatively unremarkable. This heteroscedasticity will potentially make the prediction interval coverage inaccurate.
The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals.
The autocorrelation plot shows a significant spike at lag 7, and a significant Ljung-Box test at the 5% level. However, the autocorrelation is not particularly large, and at lag 7 it is unlikely to have any noticeable impact on the forecasts or the prediction intervals.
The residuals from the multiple regression model for forecasting US consumption plotted against each predictor in Figure 7.9 seem to be randomly scattered. Therefore we are satisfied with these in this case.
us_change |>
left_join(residuals(fit_consMR), by = "Quarter") |>
pivot_longer(Income:Unemployment,
names_to = "regressor", values_to = "x") |>
ggplot(aes(x = x, y = .resid)) +
geom_point() +
facet_wrap(. ~ regressor, scales = "free_x") +
labs(y = "Residuals", x = "")A plot of the residuals against the fitted values should also show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant.
augment(fit_consMR) |>
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() + labs(x = "Fitted", y = "Residuals")Observations that take extreme values compared to the majority of the data are called outliers. Observations that have a large influence on the estimated coefficients of a regression model are called influential observations. Usually, influential observations are also outliers that are extreme in the \(x\) direction.